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Preface 


Pattern recognition methods have become a common tool for analysis of Earth Observation 
multispectral image data. With the coming of the new, more complex sensors of the EOS 
system, it will be important to develop new enhancements to these tools in order that the 
full information-yielding capabilities of these new data be realized. 

It is common in the theoretical derivation of a pattern recognition algorithm to assume 
precise knowledge of the parameters of the data. However, it is usually the case in the 
application of pattern recognition methods in practice that such precise knowledge is not 
available. For example, in order to obtain optimal performance from a Bayesian classifier, 
the a priori probabilities, the multivariate distributions, and appropriate loss functions for 
each class are needed; rarely is this information available in precise form. The question thus 
arises as to how best to model such imprecise knowledge and to modify the analysis 
scheme so that algorithms perform optimally under these more realistic circumstances. This 
question is what motivated this work. 




Abstract 1 


Two essential elements needed in the process of inference and decision-making are prior 
probabilities and likelihood functions. When both of these components are known 
accurately and precisely, the Bayesian approach provides a consistent and coherent solution 
to the problems of inference and decision-making. 

In many situations, however, either one or both of the above components may not be 
known, or at least may not be known precisely. This problem of partial knowledge about 
prior probabilities and likelihood functions is addressed. There are at least two ways to 
cope with this lack of precise knowledge: 1) robust methods, and 2) interval-valued 
methods. 

First, ways of modeling imprecision and indeterminacies in prior probabilities and 
likelihood functions are examined; then how imprecision in the above components carries 
over to the posterior probabilities is examined. Finally, the problem of decision making 
with imprecise posterior probabilities and the consequences of such actions are addressed. 
Application areas where the above problems may occur are in statistical pattern recognition 
problems, for example, the problem of classification of high-dimensional multispectral 
remote sensing image data. 


1 Work reported here was supported in part by NSF Grant ECS 8507405 and NASA Grant NAGW-925 
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CHAPTER 1 


1.1 Introduction 

Inference is the process of observing a sample or samples and drawing information about 
certain parameters of the underlying process. There are two distinct categories to inference 
problems: some utilize prior information and others are based solely on the observation 
samples. It is taken as given that prior information should be used whenever available. To 
this extent the Bayesian approach provides a sound and coherent way of combining prior 
information, represented by prior probabilities and model information, represented by 
likelihood functions. 

To put these matters in concrete terms, let us define 0 = {0i, 02,—. 6 m) as the set of 
parameters or the state of nature; 7t(0j) as the prior probability on ©; and {p(xl0j); 0j e 0} 
as the set of models or likelihood functions. Then after observing x, the inferential 
statement about 6; is provided by the posterior probability p(xle.) defined by the Bayes' 
formula 


TtCeJx) = 


p(xle.) rcCej) 

"m 

^ p(xle.) 71(0^ 
j=i 


( 1 . 1 ) 


Decision-making problems are specific forms of inference problems. In decision making 
problems two other elements are added; namely a set of actions or decisions <D = 
{8i,...,5n) and a loss function L(0i,5j(x)). In many problems, the set of decisions and the 
set of parameters coincide. Then the problem of decision making is one of choosing an 
action from the set of actions or decisions ©, in such a way that the expected risk or the 

maximum risk is minimized. 


1 .2 Motivation for this research 

As mentioned earlier, when all the components in the process of inference or decision 
making, namely the likelihood functions and the prior probabilities, are known the 
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Bayesian approach provides a consistent and coherent solution. In many real world 
problems, however, the above components may not really be known, or at least may not be 
known completely and precisely. For instance, in the early stages of outbreak of any new 
disease, with a small sample size, it is difficult if not impossible to obtain a precise model 
for the disease epistemology. Another example is the case of high sample dimensionality, 
where rarely if ever the available data is adequate to lead to a precise model. 

The difficulty in specifying accurate prior probabilities is also very common. Actually the 
prior probabilities are often assigned quite subjectively. The difficulty in assigning accurate 
prior probabilities is the main reason non-Bayesian partisans attack the Bayesian approach. 
One can, however, go to the other extreme of doing away altogether with the prior 
probabilities. It seems self-evident that one should use all the information available without 
being either under- or over-committing. 


1 .3 Statement of the problem 

The three interrelated problems to be addressed are: 1) how to describe imprecise prior 
probabilities and likelihood functions, 2) how to proceed from imprecise priors and 
likelihood functions to imprecise posterior probabilities, and 3) how to make decisions with 
imprecise posterior probabilities. 


1.4 Useful concepts and terminologies 

1.4.1 General remarks 

It is important at this point to draw the differences between various sources of uncertainties; 
namely, randomness, vagueness, indeterminacies, etc. In this work, the main concern is 
with imprecisions resulting from one's inability to specify accurate priors and conditional 
densities. Therefore, imprecisions due to "indeterminacies" are the main concern. 

An extreme case of indeterminacies is called "total ignorance". Conventional approaches for 
handling total ignorance (especially concerning prior probabilities) is to assign probabilities 
based on uniform distribution; i.e., if the state of nature is 0= {0i, 02,..., 0 m) and there 
is no prior information about the parameters, one may be inclined to assign. 
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( 1 . 2 ) 


rc( 0 i) = jjj. 

There are at least two criticisms to this method of assigning probabilities: 

1) In the case of "total ignorance", intuitively, probabilities should be assigned as 

Jt(0i) = [0,1], i=l M. 

2) When the state of nature 0 is continuous (e.g., © = %), this approach gives 

improper probabilities; i.e., 

J d;t(e) = 0 ° (1.3) 

e 

It is shown by Berger [3], that decisions based on improper distributions may give rise to 
inconsistencies (for definition, see below). 


1.4.2 Terminology and Notation 

The unknown quantity 0 which affects the decision process is called the state of nature or 
the parameter. Prior probabilities for 9j are denoted rc(0j). The set of possible outcomes is 
the sample space and will be denoted X. (Usually, Xwill be a subset of ^). The outcome 
of the experiment (i.e., the observation) will be denoted X. Often X will be a vector. The 
term "conditional densities", or "model", or "likelihood functions" is used to refer to the 
same quantity; i.e., (p(xl0j); 0 jg 0} or sometimes written as {pgj(x); 0j g 0}. Ee x [f(x)] 
will denote the expectation (over X) of a function g(x), for a given value of 0. L(0j, 5(x)) 
will represent the losses incurred when upon observing sample x, decision 5(x) is made 
and the true state of nature is 0j. 

The risk of a decision rule 8(x) is defined as 

R( e, 8 ) = [ L(e, 8(x)] = J L( e. 8(x)) dP(xle) . (1.4) 

* 

This is the expected loss, for each e, if 8(x) is used repeatedly with varying x in the 
problem. 
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In order to decide about what type of decision rule should be used, some sort of ordering 
of decision rules is needed. The following definitions (Berger [3]) serve as guide lines. 

DEFINITION 1.1: A decision rule 5i is R-better that a decision rule 82 if, 

R( e, 5j) < R(e,5 2 ) V ee 0. (1.6) 


DEFINITION 1.2: A decision rule is admissible if there exists no R-better 
decision rule. A decision rule is inadmissible if there exists an R-better decision 
rule. 

DEFINITION 1.3: The Bayes risk of a decision rule 5, with respect to a prior 
distribution k on 0, is defined as 

r(rc,8) = R(e, 8)] = J R(e, 8) 71(0) de . (1.7) 

e 

Two frequently used decision-making principles are: 

1) The Bayes Risk principle stated as 

A decision rule 81 is preferred to a rule 82 if, 

r(7t, 8j) < r(jt, 8 2 ). (1.8) 

A decision rule that minimizes r(7i,8) is optimal and is called a Bayes rule. 

2) The minimax principle stated as 

A decision rule Si is preferred to a decision rule 82 if, 

sup R(o, 8j) < sup R(0, 8 2 ) . /j m 


A decision rule is a minimax decision rule if it minimizes sup R(0,8) among all the rules in 

0 


■D. 
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DEFINITION 1.4: coherency and coherent inference 


The concept of coherency can be best explained in terms of betting, so let us define betting 
first. 

DEFINITION 1.5 : A bet [46] concerning an event E is an arrangement 
whereby a sum of a|3 is exchanged for a sum of a if E occurs or 0 if it does 
not. The bet is said to be on or against E according a > 0 or a < 0. (3 is called 
the betting rate and a the stake. Let e be the indicator of the event E and 1 the 
indicator of the sure event Q. Then a bet concerning E is a random quantity of 
the form a(e - (31). 

Let (O, .3) be a measurable space with events Ej e A, i =l,2,...,n. And let a real-valued set 
function P(E ; ) represent the betting rate. Then De Finetti [7] shows that only when P(E ; ) is 
a probability, i.e. satisfies the axioms of probability, can one avoid the "sure loss" case. 
Only when P is a probability function, would it not be possible to select Ej, E2,..., E n and 
the stakes aj, (X2,..., oc n so that a combination of bets relative to these events, at the rates 

n 

P(Ei) for event Ej, i.e., £ (ccKe;- P(Ej)l)), will assure a positive gain, (of course, it 

i=l 

would be the same thing to require that no such quantity should be uniformly negative). 

Decisions and inference based on coherent real-valued set functions, P, are called coherent 
decisions and inferences (Regazzini [33]). 
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CHAPT EEL2 

VARIOUS APPROACHES FOR HANDLING IMPRECISION 

2.1 Introduction 

The Bayesian approach offers an elegant way of combining prior information (i.e., prior 
probability n over ©) and model information (i.e., ( p(xl e) ; Be 0 }), to construct a 
distribution p over 0xj|j where p is the unique probability distribution over 0x1 that 
has k as its marginal for e and the p(xl 0) as its conditionals given 0 . After observing x, the 
Bayesian conditions p on x to obtain posterior probabilities for 0 . Decisions based on the 
Bayes decision rule can be shown to be coherent. 

The major criticisms to the Bayesian approach, however, are its requirements for precise 
knowledge of probability values and the subjectiveness of prior probabilities. The issue of 
prior probabilities being subjective is a philosophical one which will not be addressed here. 
One approach to make prior probabilities more objective (frequentist) is to obtain prior 
probabilities from n experts and use the weighted average of those. 

In an attempt to relax the requirement for precise probability values, several methods have 
been proposed in the literature. 

2.2 Minimum cross-entropy method 

Many people [8,14,19,31,34,35,39-41,47] have tried to quantify available prior 
information and data without being over committing. One possible approach is the 
minimum cross-entropy method. Here, the prior information about an underlying 
distribution, p, and the available information I, which is usually in the form of constraints 
on the moments, is combined via operator o to obtain the posterior probability q; that is [39] 

q = poI (2.1) 

Specifically, let q* be the unknown underlying probability density function and the 
available information, I, be given as 
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( 2 . 2 ) 


J g k (x) q*(x) dx = c k 


where gk( ) are some known functions and Ck are known constants. Further, let the cross- 
entropy (also known as discrimination information, directed divergence, or I - divergence) 
between two probability density functions q and p be defined as 


H[q,p] = J q(x) log[ q(x)/p(x) ] dx 


(2.3) 


Then a posterior probability q( ), whenever it exists, which minimizes the above quantity 
and satisfies the obvious restriction of 



(2.4) 


is the one given by [39-41] 


q(x) = p(x) exp { -X - £pk gk(x) } (2.5) 

k 

where Pk and X are the Lagrangian multipliers for equations (2.2) and (2.4). 

Remarks: 

1) It has been shown [40], that the only operator o that satisfies uniqueness, 
invariance, and some other axioms of consistent inference and is implemented 
by means of functional analysis is the one given by the principle of minimum 
cross-entropy. 

2) The maximum entropy method is a special case of the minimum cross-entropy 

method where there is no prior information or prior information is uniformly 
distributed. 

3) Intuitively, the minimum cross-entropy method provides a posterior probability 

q(x) that is the closest distribution, in the sense of H[q,p], to the prior 
distribution p, yet satisfying the new information provided. 

4) Even though H[q,p) is not a metric (does not satisfy the triangle inequality) it is 

a good information theoretic measure of closeness. 

5) q is closer to the unknown underlying distribution q* than is p. 

The main difficulties with the minimum cross-entropy methods are [3] 
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1) In many cases a solution may not exist. 

2) The requirement that information I be specified as various moments could be 

very restrictive. 

3) A solution, when it exists, is usually in many senses non-robust. 


2.3 "Sup" and "inf" approach 


Let Q be the sample space and the appropriate a-algebra on Q. The most natural way to 
incorporate imprecision (i.e., indeterminacies) in probabilities is to define a family of 
probability measures (P, instead of a single probability measure p, over (£2,.fl) This 

naturally leads to upper and lower probabilities 


and 


P*(A) = sup P(A) VAe^ 
Pe <p 


( 2 . 6 ) 


P,(A)= inf P(A) VAeJ3 (2.7) 

Pe (P 

True probabilities, P(A), are upper and lower bounded as 


P*(A) <: P(A) <: P*(A) VAej?. (2.8) 


Note that, even though every Pe !Pis a regular probability measure, P* and P* themselves 
need not be additive probabilities. Depending on the structure of (P, P* and P*may be 
measures that instead of being additive, are super- and sub-additive known as Choquet 
capacities ; capacities will be defined rigorously in the sequel. 


2.4 Robust methods 


The term "robust" was first used by Box in 1953. It usually refers to the situation where 
the performance does not degrade much as the parameters (here prior probabilities and 
likelihood functions) vary from their nominal values. There are two aspects to robustness; 
i.e. robustness analysis (also known as sensitivity analysis) and robustness design. The 
terms robust and stable are used sometimes to mean the same thing. 
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2.4. 1 Distributionally robust approaches: 


Here, first a set of nominal likelihood functions (or models) and a set of nominal prior 
probabilities is specified. One could do this even when the sample size is small and there is 
not much confidence in the sample. Then define a neighborhood for the nominal model and 
a neighborhood for the nominal priors. These neighborhoods reflect our confidence (or 
lack of it) in the nominal values. Finally, design the inference or decision-making 
procedure with the fact in mind that the actual model and the actual priors could vary within 
their respective neighborhoods. One could define these neighborhoods at least in two ways: 

I) the neighborhood of a given model, 

II) neighborhoods composed of a mixture of models. 

I) The neighborhood of a given model Mq : 

Let M be the class of all models (e.g., the class of all prior probabilities, or the class of all 
likelihood functions). Let M 0 be the nominal model and Mj be a wider class of models 
including Mq This idea is easily depicted in the following figure 



II) The neighborhood composed of a mixture of models: 

When it is difficult to justify a single neighborhood of a model, one defines neighborhoods 
that are composed of a mixture of models. Graphically this is illustrated in the following 
fig. 2 
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Specific examples of these neighborhoods for prior probabilities and likelihood functions 
follow. 


2.4.2 Example of neighborhoods for prior probability 

Let us assume that 7t, the true prior probability, belongs to class F of the prior probabilities, 
where T is defined as follows: 

11 Band model; [11] 


r = { arc : L <: 7t < U } (2.9) 

where L and U are lower and upper nonnegative bounding functions and a is just a 
normalizing factor to make n a probability measure. 

One way to obtain lower and upper bounds is to estimate the prior probabilities and then 
find a confidence interval (limit) for the estimates; thus creating a band for priors. 

Remark: 

Strictly speaking, prior probabilities should be independent of data and should 
be provided without looking at the data. 

2) £ - conta mination model: [34] 


T = { it : 7t = (1-e) 7t 0 + e Jtj } ( 2 .10) 

where Jto is the nominal prior probability, e is the degree of uncertainty in the nominal 
priors, 0 <1 e < 1, and Jti is any unknown and completely arbitrary probability measure. 
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The rationale for this model [ 18 ] is as follows. Consider a Bayesian decision-maker who 
after looking at the observation x realizes that his prior belief n was very far off the mark. 
Should he stick to it and obtain a posterior distribution nobody, even he himself will 
believe in? Or should he cheat and change the prior? e-contamination allows one to keep an 
e of the prior mass in "reserve for emergencies" to cope with situations like above. 

3) Prior probabilities specified bv linear inequalities : [31] 

In many cases one may only be able to make statements such as: 0i is ten times more likely 
than 02, or 0i is less likely to occur than 02 and 03, etc. Such partial prior information could 
be specified by set of linear inequalities of the form 

T 

r={jc:aJt>o,l7c=l,7C£0} r?in 


2 . 4.3 Uncertainty models for likelihood functions 

In many cases, a precise model for the phenomena under observation may not be available. 
For instance, in the early stages of a new disease a precise model may not be available. 
Obviously, it would not be very appropriate to use a very precise model since 
consequences of error in the assumed model may be very serious both financially and in 
terms of human factor. Two extreme case approaches here are either adapting parametric 
approaches or distribution-free approaches. Something between these two extreme cases 
will be raised, however. 

1) Elaborated model: [15] 

Let f(xle) be the nominal model for data x and parameter 0. Then an elaborated model (EM) 
can be represented as a family of densities [f(xl0, X), X e a] with f(xl0) = f(xl0,l o ) for 
some A. Examples of this type of model are: 

1 . 1 ) The exponential power family 

r 

- c(X) 


f(x! (i,a,X) a' 1 expi 
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with \<= (-1,+1). Here X.->-l corresponds to a uniform density, X =0 to the normal 
density, and X.= +1 to the double exponential density. X could be considered, here, as a 
measure of kurtosis. 


1.2) The Huber family 


f(xl|x,a,A.) 




(2.13) 


where 

fix 2 M<X 

g(x) = I Ajxl .lx 2 M>X 

l 2 (2.14) 

Notice that as X,— >0* this becomes a double exponential density and as X-— »°°> it tends to the 

normal density. For other values of X, one obtains a normal center and exponential tails. 

Notice that to proceed with the above model to the posterior probabilities one would require 

the knowledge of the joint prior probabilities, 7t(0,X.). This point will be returned to latter. 


2) Band model; [20] 


Conceptually, band models for likelihood functions (or conditional densities) are similar to 
the band models for prior probabilities. Suppose fe(x) (or fe(xl0)) is the density function, 
with respect to some measure (i (e.g., Lebesgue measure) on the measurable space (X-T), 
of a probability measure Pe(x) (or Pq(x!0)) Consider the neighborhood defined as, 



f I f L < f f y , J* f d|X =1 > 

x 


(2.15) 


where fL,fu are nonnegative bounding functions with ft being bounded. This model may be 
useful, for instance, when the density function f estimated from training data are expressed 
as lying within pairs of confidence limits. 


3) e - contami nation model: [22] 


5 = {f| i f, - (i-6i ) f„ + e, «■, .4,6 « } 


(2.16) 
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This is also similar to the model introduced above for prior probabilities, except here f, f D , 
and ft are conditional densities. This model was first introduced by Tukey and has the 
following intuitive interpretation in statistical classification problems. 

The class 9j of observations consists of two classes: the well known frequently observable 
class that has the known density f 0 and non-studied, rarely observable class with an 
unknown density hi ft{, e If 0i is observed, then an observation from the first part 
appears with probability (1- ej) and from the second part with probability e; The e- 

contamination model is a special case of the band model where fiL = (l-£j)f[ and >°®. 

4) Total variat ional model: [34] 


This is another useful neighborhood defined as 


fP={ P: I P(A) - P 0 (A) I s e } , V Ae * 


(2.17) 


where (Q,^) is the measurable space on which the probability measures are defined. In 
terms of densities, it can be written as 


J=|f: JlfOO -f 0 (x) I dx <: ej 


(2.18) 

Once the neighborhoods are defined, then the problem of decision making is to choose an 
action, from the set of possible actions (or decisions) that minimizes the maximum risk; 
i.e., a minimax approach. Let us use the notation introduced earlier; except to make things 
more explicit, the risk function will be written as 


r ( rc, 8 ) = r(7t(e), {f fi (x) }, 8(x) ) 


(2.19) 


Then the minimax decision rule 8*, (or actually, T- minimax decision-rule, since the priors 
are allowed to vary too) is given by 


8*(x) = arg min ( max r ( *(»)• 5 ( x > > ) 

8e <D 7 te r 

(f 6 }ej 


( 2 . 20 ) 
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The prior probability n and {fe]e e e for which 5* is attained are called the least favorable 
distributions and would be denoted (rc L (0), (fe} L ). Note that 8* and (7t L (0), (fe) L ) satisfy 

r ( 7 t( 9 ), {f e }, 8*(x) ) <: r ( n L (e), {f 0 ) L , 6*(x) ) <: r ( 7t L (e), {f 0 } L , 8(x) ) (2 .21) 


It is important to note here that, even though it is conceptually simple to model above 
minimax approach, obtaining solutions (i.e., 7t^(0), (fel^sS*) may not be so simple. 
Solutions for minimax (but not T-minimax) problems have already been found for certain 
type of neighborhoods such as e-contamination, band models and total variational 
neighborhoods. These neighborhoods all have one thing in common: They all could be 
specified as (P, where Pis a family of distributions defined over measurable space (Q,^0 

as 

(P — { Pg fW ; P(A) <; &(A) , VAe 1A } ^2 22) 

where fWis the set of all probability measures defined over measurable space (O-^)- P is 
said to be the set of probabilities majorized by v. 

For an £-contamination neighborhood, i<A) is defined as 

r<A) = (l-E)P 0 (A) + e ,A*0 (2.23) 


For a total-variational neighborhood, i(A) is defined as 

»(A) = min ( ^o(A) + e > 1 ) ,A^0 (2.24) 

v(A)s defined above have an interesting property; namely, they are set functions that satisfy 
the following properties [5] 


pl) 

v(0 ) = 0, 

i/(Ol =1 

p2) 

AcB 

t(A) <: i^B) 

p3) 

a„Ta 

=> ^AJ T i<A) 

p4) 

F„ iF,F„ 

closed, => t<F n ) -l i<F) 

p5) 

v(AuB) <: 

v(A)+ v(B) -v(AnB). 
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Any set function that satisfies pl)-p4) is called a Choquet capacity or capacity for short. If 
it also satisfies p5), then it is called alternating of order 2, or for short, 2-alternating 
capacity. A set function uthat satisfies pl)-p4), and instead of p5) satisfies 

p6) u(Akj'B) > u(A) + u(B) - u(AoB) 

is called a 2 -monotone capacity. More generally, consider the successive differences 
defined [29] as 


VjCBjBj)^ = t<B)- t<BuBj) 


(2.25) 


^n+l ' ^n(®^B n+l ; B n ) y (2.26) 


If V^O for k=l,...,n, then v is called an n-altemating capacity; if V n <0 for all n, it is 
called and infinite alternating capacity. Similarly, let 


Aj(B; Bj) u = u(B) - ufBnBj) 


(2.27) 


A„ + j( B, Bj,...., B n+J ) u A n ( B, Bj,...., B^y - A n ( BoB n+1 ; Bj,...., B n ) u 


(2.28) 


If Ak ^ 0 for k=l,...,n, then u is called an n-monotone capacity; if A n > 0 for all n, u is 
said to be an infinite monotone capacity. Note that alternating and monotone capacities v 
and u, satisfy 


t^A) + u( A ) — 1 ^2 29) 

and are said to be conjugates. 

Let us consider the simplest form of decision-making; that is testing a null hypothesis H 0 
versus an alternative hypothesis Hj. And suppose the prior probability of Hq (and Hi) is 
known and is given by (and r£j), te[0,°°]. Furthermore, suppose the hypotheses 

correspond to two imprecisely known likelihood functions; and they can be modeled as sets 
majorized by 2-altemating capacities v 0 and v\. That is. 
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and 


<P Q zIP^^o} 


(2.30) 


= { P : P * v i } 


(2.31) 


Recall that this includes such models as the e-contamination and the total-variational model, 
etc. Thus one is testing composite hypotheses 

Ho : X - <P 0 
vs. 


H, :X 



(2.32) 


Let A be the critical region of test; i.e. reject (P 0 if xe A is observed. Then the upper Bayes 
risk of the critical region A is (Huber & Strass[ 17]) 


G t (A) 


t 

t+1 


*o(A) 


t 


+ _i_(l- »j(A) 

t+1 


) 


(2.33) 


To minimize Gt(A), it is enough to minimize the 2-altemating function 

W t (A) = t p 0 (A) - Uj(A) . 


(2.34) 


Huber and Starssen [17] state and prove the following lemma 

Lemma 1: For each te [0,°°] (i.e., any given priors), there is an A t such that, 

W t (A t ) = inf W t (A) 

A (2.35) 

Note that A t minimizes the maximum Bayes risk. 

Another approach, other than this minimax approach, could be one based on translating the 
imprecision in priors and likelihood functions onto posterior probabilities obtaining a 
family of lower and upper posteriors. For the sake of simplicity, let us examine the cases of 
imprecision in priors and likelihood functions separately. 
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First, let us assume that likelihood functions, p(xle), are known precisely and the only 
source of imprecision is due to priors which can also be modeled by the e-contamination 
neighborhood 


r = { n : k = (1-e) k q + e Jtj } 


(2.36) 


Let 7t(0jlx) denote posterior probability given by the Bayes rule as 

p(xle ) 7t(e ) 

71 ( 0 ] x) = -= — — 

2 i p ( xl0 i) 

8.ee 


(2.37) 


and let 7t o (0ilx) denote the posterior probability corresponding to the nominal prior 7 t 0 . Then 
Huber [18] shows that 


and 


where 


7t 0 (elx) + S(e) 
sup 7C(0.I x) = — 

*4r 1 i+ s(9.) 


7C o (0 l X) 

inf 7c(e.l x) = - ' 

rceT 1 1+ S(ef) 


(2.38) 


(2.39) 


S(0 ; ) = 


p(xlOj) 


1-e 


X P(xl0 i ) 

e.e© 


(2.40) 

Now suppose that both the likelihood functions and the prior probabilities are given by the 
e-contamination models. Following Huber [18], one says upon observing x, the 
"information" about 0 is increased by the (possibly negative) amount 

y, rc(0jl x) log 7t(0.l X) - ^ 7C(0j) log 71(0.) . 

9.6 0 9^0 

1 1 (2.41) 

Then a family (p(- 10;)} of conditional densities and a prior probability k will be least 
informative if they minimizes 


H(p, n) = E x [ { ^ 7i(0. 1 x) log 7t(0. lx)} - ^ 7t(0.) log 7t(0.) ] (2.42) 
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-xz* x I e. ) n(0.) 


X 0. 


log 


p(xl0.) 7C(0.) 

X P< xle P *( 0 i) 

e: 


- £ 71 ( 0 ) log 7i(0 ) (2.43) 

e. 


subject to the side conditions that 


and 



(2.44) 


" L (2.45) 

0 . 

i 

Note that it was assumed that *is finite. Solution for this problem, except perhaps for very 
trivial cases, is difficult to obtain. 


2.5 Interval-valued probabilities 

Bayesian frame of inference and decision-making requires precise probabilities and has no 
provisions for imprecise knowledge. There has been many attempts [2,8,11,24,42,43, 
45,46,47] to generalize classical "point-valued" probabilities to "interval-valued" 
probabilities. Dempster [8-10], and later Shafer [35-38], in an attempt to generalize the 
Bayesian framework, have come up with what is known as the Dempster-Shafer (D-S) 
theory of evidence [35]. We will start with an example first, then proceed to point out the 
major problems with the D-S theory, and finally describe a more natural extension of usual 
probability measures and Bayes theorem cast in this new framework. 

2.5. 1 Dempster-Shafer theory 

The basic idea can become clear with the following (desk) example. Suppose there is a 
desk with two drawers on the right side: the right top drawer (RT) and the right bottom 
drawer (RB). There are three drawers on the left side: the left top drawer (LT), the left 
middle drawer (LM), and the left bottom drawer (LB). Suppose a file is placed, at random, 
in one of the drawers. Further suppose that the available information (evidence in the D-S 
language) is given as 
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(2.47) 


Prob ( file is in the left side drawers) = nij = 0.5 
prob ( (RT) " ) = = 0.2 


and there is no more information. 


Note that the total evidence, mi+ mj = 0.7 < 1. Shafer calls the difference (1- 0.7 =0.3), 
the global ignorance. The global ignorance can be assigned to any of the drawers, and yet 
none in particular. Then given the above scenario, one would like to answer questions like 
what is the probability that the file is in the (LM) drawer? etc.Obviously, the answer to this 
question can not be given by a single number. George Boole [4] was the first to realize this 
point and he suggested the idea of inner and outer measures, p* and p*, such that 
probability of any event, p, is bounded by p, and p* as 


Then how does one compute p 4 and p*? Shafer calls m's the basic probability assignments 

or (bpa)'s. m(A) represents the measure of belief that is committed exactly to set A and not 
to any of its proper subsets. Moreover, let us denote the sample space by Q, and assume it 
is finite. Let 2^ represent the power set of Q. Then 


DEFINITION 2.1: (Shafer [35]) 

A function m: 2&—> [0,1] is called a basic probability assignment (bpa) whenever 


and 


(1) m(0) = 0 


(2.50) 


(2) ^ m(A) = 1 . 

Ac£i (2.51) 

Note that 

i) It is not required that m(Q) = 1; 

ii) It is not required that m(A) < m(B) when A c B ; 

iii) There is no obvious relationship between m(A) and m(A c ). 

Recall that m(A) reflects the measure of belief that is committed exactly to A, not the total 
belief that is committed to A. To obtain the total belief committed to A, Shafer argues, that 
one must add to m(A), the bpa of all the proper subsets B of A. He calls this "BELIEF" or 
Bel for short. That is 
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Bel(A) = ^m(B). 

BcA (2.52) 

Dempster in his original work called these Bel's, lower probabilities. More formally, a 
function Bel: 2 n ->[0,l] is called a belief function over Q. if it is given by (2.52), for some 
bpa m: 2 n -»[0,l]. For our earlier "desk" example : 


Bel (file is in (ML) drawer) = 0. 

Bel(file is in (RT) drawer) = 0.2. 

It is important to note that 

Bel (A) + Bel (A c ) <1. (2.53) 

To see the implication of this relationship, suppose there is no evidence at all to support A, 
or Bel(A) = 0. Then, (2.53) says that, in D-S theory, it is not automatically implied that 
Bel(A c ) = 1; i.e., lack of belief in something does not necessitate its compliment. 


Furthermore, the bpa that produces a given belief function can be uniquely recovered from 
the belief function. This inverse relation is called mobius inverse. For any belief function 
Bel, a dual function plausibility (or "PI" for short) is defined as 

PI (A) = 1- Bel (A c ) . (2.54) 

In terms of bpa, m, plausibility could be written as 

PI (A) = Yj m ( B > • 

BnA* 0 (2.55) 


Dempster called these Pi's, upper probabilities. Note 


and 


PI (A) + PI (A c ) > 1 
PI (A) > Bel (A) . 


(2.56) 

(2.57) 


From our earlier "desk" example: 
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PI (file is in (ML) drawer) = 0.3 
PI (file is in (RT) drawer) = 0.5. 


To make the idea of "Bel" and "PI" clearer, let us consider the following example. Suppose 
we are given: m(Bj) = 0.3, m(B 2 ) = 0.4, m(B 3 ) = 0.1, m(Q) = 0.2, and want to find the 
lower and upper probability (or Bel and PI) of a set A given in the following diagram. 



Then 


Fig. 3 

Bel (A) = £m(B.) = m(B 2 ) = .4 

B.£A 

1 

PI (A) = £ m(B.) = m(B 1 ) + m(B 2 ) + m(Q) 

B.r\A*0 

i 

= .3 + .4 + .2 = .9 . 


Shafer, further argues that the class of belief functions can be characterized without 
reference to basic probability assignments. That is: 

THEOREM 2.1: Shafer [35] 

A function Bel: 2^ — >[0,1] is a belief function if and only if it satisfies the following: 

(1) Bel (0) = 0. 

(2) Bel (Q) = 1 . 

(3) for every positive integer n and every collection A , A., ...., A n of 

subsets of Q ^ 

Bel (AjU....uA n ) > £ Bel (A.) - £ Bel ( A.nA.) + ....+ (-l) n+1 Bel (AjO ...,oA n ). 

i i<j 
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Remark: Note that Bel functions are infinite monotone capacities. 


Similarly, one can define plausibility functions as 

THEOREM 2.2: 

A function PI: 2 £i_> [0,l] is a plausibility function if and only if it satisfies the 

following conditions: 

(1) PI (0) = 0. 

(2) PI (£2) = 1 

(3) For every positive integer n and every collection A^...., A n of 
subsets of £2 

PI (A 1 n....nA n ) £ £ PI (A.) - ^ PI (A.uA.) + ....+ (-l) n+1 PI (A^-.uAn ) . 
i i<j 


Remark : 

1) Note that PI functions are infinite alternating capacities. 

2) When Bel(AuB) = Bel(A) + Bel(B), AnB = 0 belief function becomes the 
usual classical probability measures. Furthermore, one can show that (Klir 
[23]) a belief function, Bel, on a finite power set 2 a is a probability measure if 
and only if its basic probability assignment, m, is given by m({W})=Bel({w}) 
and m({A}) = 0 for all subsets of £2 that are not singletons. 

3) A Bel function that satisfies Bel (A) = 0 for every proper subset A of £2 is called 

a vacuous belief function. In terms of basic probability assignments, this means 
m(£2) = 1 and m(A) = 0 for every proper subset A of £2. Furthermore 

plausibility of every such A is one. That is 

Bel (A) = 0 £ pr (A) < PI (A) =1 V A c £2 . 

Now that we are equipped with the basic notions of D-S theory, let us see how this theory 
address two major issues: 1) combination of various sources of information (evidence), 
and 2) the rule of conditioning. 


2.5.2 Combination of various sources of information 

First of all D-S theory requires that sources of evidence be independent (or non- 
interacting). Sources of evidence in remote sensing could be for instance, multispectral 
data, elevation data, slope data, etc. Or in medical diagnosis, sources could be the opinion 
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of several doctors (experts) about the same patient. D-S theory proceeds to attain the bpa's 
from each source and then combines the bpa’s with what is known as Dempster's 
orthogonal sum. More specifically, let Q be the sample space and let mi and m2 be two 
bpa's obtained from information sources Sj and S2 respectively. Then the total information 
obtained about Q from the sources Si and S2 is given by the new bpa m(c), given as 

^ m 1 (A.).m 2 (B j ) 

( \ A.nB .=c 

m i ® m 2 ) (c) = — ( 2 . 58 ) 

1 - 2 j m 1 (A i ) 

A.nB.= 0 
1 J 


Note that order of combination is not important. That is 

rrij© rr^ = m 2 © m t 


( 2 . 59 ) 


i.e., Dempster's orthogonal sum is commutative. Also, if there are three independent 
sources specified by their bpa'smi, m2, and m3, they can be combined by the successive 
application of above rule. That is 

m = (m 1 0m 2 ) emj= n, i e( m 2®m 3 ) 
and the order of aggregation is not important; i.e., 0 is an associative operator. 


Intuitively, Dempster's orthogonal sum says that, to find the joint bpa for a set c, take all 
the sets from source Si, i.e., Aj's, and all the sets from source S2, i.e., Bj's, multiply their 
bpa’s and add over all such sets. The denominator is a normalizing constant; it is required 
since one of the requirements for a valid bpa function is that it must sum to one. 
Dempster's orthogonal sum is the heart of D-S theory and also the major source of 
controversy and criticism. The following example, originally due to Zadeh [ 48 - 50 ], 
highlights this issue. Suppose ©={0i,02, 03 } is the sample space of outcomes, and the 
information available from two independent sources lead to two sets of bpa's given below 
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“2 


.9 

0 


0.1 0 

0.1 .9 


Then upon applying Dempster's rule of combination, one obtains, 

6 1 0 2 0 3 

m = nij® m 2 0 10 

That is, even though both sources individually reflect low beliefs on { 62 }, after 
combination, they collectively confirm {02}! This is highly counter-intuitive; and again 
results from the normalization needed in the Dempster's rule. 

Walley [44], Krantz and Miyamoto [25], and Shafer [36] have tried to apply the D-S theory 
to the problem of statistical inference. For the sake of simplicity, suppose that the state of 
nature 0 is finite; i.e., 0={6i,..., 0 n } and we have k statistically independent 
observations, each specified by a standard parametric model {Pq\ 0e © } , for i=l , ..., k. 

The Pg^'s are probability mass functions on a sample space X. Each p^ describes a 

different statistical experiment, but all are governed by the same parameter 0. For the 

remote sensing problem, p® could be the model for multispectral scanner (MSS) data, and 

u 

p® could be the model for the elevation data, etc. Each observation x('), i=l,...,k gives 
rise to a belief function, Bel® j)(0), i=l,...,k, over 0. Bel®j)(0), i=l,...,k are constructed 

depending only on the observation x( } ) and the model values pe,(x( , )),...,p 0n (x( i )) Prior 
information also gives rise to a belief function, Bel o (0), over 0 Then the overall belief 
function is constructed as 

Bel (e) = Bel 0 (e) © Bel (1 ( ) 1) (e) © Bel (2) 2) (e) ©....© Bel } ( 0 ) , ee 0 . (2.61) 


The main conclusions are (Krantz and Miyamoto [25] and Walley [44]) that Dempster's 
rule is not generally suitable for combining evidence from independent statistical 
observations (or otherwise, statistically related observations) nor is it suitable for 
combining prior belief with observational evidence. Stated differently, if the Bel function is 
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interpreted as lower betting rates, then the use of Dempster’s rule to combine prior and 
likelihood functions can lead to a sure loss or " Dutch book". That is, Bel cannot 
coherently be interpreted as lower betting rates when Dempster's rule is used to combine 
priors and likelihood functions. 

Finally, it is also interesting to note (Shafer [36]) that in the Bayesian frame of inference, 
for a given prior probability distribution 7t 0 over 0 and a given statistical model [p 0 ; 0e 0} 

over X, one can construct a unique distribution p over 0xX unique in the sense that p is 
the only distribution that has n 0 as its marginal for 0 and pe as its conditional given e. In the 
D-S theory, there may be many belief functions over 0xjf having a given marginal Bel 0 
and given the conditional pe- 


2.5.3 Conditioning rules 

An important issue in decision-making and inference is how to change our belief 
concerning a particular event in light of new evidence. Of course, when the available 
information is in the form of classical point-valued probabilities Bayes rule provides a 
natural and sound way of accomplishing this task. In the following section other 
possibilities are examined. 

2.5.3.a Conditional Bel and PI 


Suppose the available information can be represented by a Shaferian belief function, Bel, 
and plausibility function, PI, on the frame of discernment 0. Suppose, further, that 
somehow one learns that 0 is restricted to B, B c © . Then Shafer [35] suggests the 
following: 

1) Represent this new information as a new belief function 


Bel (A) = 


1 

.0 


if Be A 
otherwise. 


(2.62) 


2) Combine this belief function with the belief function available prior to the new 
information by the Dempster's rule of combination to get 
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(2.63) 


Bel ( A I B) = 


Bel (AuB c ) - Bel (B c ) 
1 - Bel (B 0 ) 


Since 

One obtains, 


PI (A) = 1- Bel (A ) 


PI (A I B) = 


PI (AnB) 
PI (B) 


2.5.3.b Cond itional "sup" and "inf' 


(2.64) 

(2.65) 


Referring to section 2.3, suppose imprecision about the available information is represented 
by a family of additive probability distributions (P; and 

P*(A) = sup P(A) 

Pe fp (2.66) 


and 

P„(A) = inf P(A) 

Pe (P (2.67) 


Suppose the new information implies that © is restricted to B, B c © . Then one natural 
way of revising our earlier beliefs (probabilities) is to say 

P(A n B) 

P(B) 

( 2 . 68 ) 


P(A n B) 

P(B) 

(2.69) 


The following theorem is due to Huber [18] (also see Kyburg [26]) 

THEOREM 2.3: (REPRESENTATION THEOREM) 


P*(A I B) = Sup 
Pe (P 


and 


P,(A I B) = inf 
PefP 


Given a belief function, there exists a closed convex set of classical probability 
function (P c defined over atoms of © such that for every subset A of © 
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(2.70) 


Bel (A) = inf P(A) 

Pe2> c 

And conversely, if !P C is a closed convex set of classical probability function defined over 
atoms of 0, and for every A p A 2 ,... M A n c 0, 

inf P( AjUA 2 u ...u A n ) £ V inf P(A.) - \ inf P(A.nA.) + .... 

i=i ' J (2.71) 

+ (-1)" inf P(AjOA 2 n n A n ) . 

Then there exists a belief function, Bel, such that 


Bel (A) = inf P(A) 

Using the above representation theorems, it can be easily shown that 

inf P(A I B) £ Bel (A I B) < PI (A I B) < sup P(A I B) 
Pe P 6 ^P C 


(2.72) 


(2.73) 


That is, Shafer's rule of conditioning provides a tighter bound on the conditional values. 

It is also interesting to note that Bel (A I B) and PI (A I B) obtained from Shafer's rule of 
conditioning are still ©©-monotone and ©©-alternating capacities. Shafer's results are 
questionable, however, since they are directly based on Dempster’s rule of combination. 


Diaconis and Zabell [12,13] recommend the following rale : 


and 


P*(A I B) 


P,(A n B) 
P*(B) 


P*(A I B) = 1 - P,(A I B) . 


(2.75) 

(2.76) 


Again, P*(- IB) and P*(- IB) would still be <»-monotone and ©^-alternating capacities. 
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2.5.4.C. Proposed conditioning rule 


Both Dempster's rule of conditioning (eq.2.65) and Diaconis and Zabell's rule (eq.2.75) 
are counter-intuitive. For instance, let us consider Dempster's rule. Applying the 
representation theorem (Theorem 3.2) to the left hand side of the equation one can write 

P.(A I B) = inf P(A I B) = inf 

(2.77) 

Applying Theorem 3.2 to the right hand side of eq. 2.75 one obtains 

P,(AnB) _ inf P(A n B) 

P*(B) " inf P(B) 

$ (2.78) 

But obviously, in general, 

. P(AnB) inf P(AnB) 

5 P(B) * inf P(B) 

& (2.79) 

Considering this discrepancy, the following conditioning rule is suggested. 


and 


P,(AIB) = 


P»(A n B) 
P*(B) 


P ( A I B) = 1 - P*(A I B) 


(2.80) 

(2.81) 


Notice that our definition (eq. 2.80) differs from, for instance, eq. 2.75 in that lower 
conditional probabilities are computed as ratio of lower joint probabilities and upper 

marginals; that is, the normalizing factor in the denominator is P (B) instead of P*(B). 

It may be shown (proof omitted here) that P*(AIB) and P*(AIB) obtained above by our rule 
of conditioning are also <*>-monotone and ^-alternating capacities, respectively. 
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2.6 Problems to be solved 


It is important to realize that the representation theorem. Theorem 3.2, states only the 
existence of a family of probability distributions CP. It does not, however, suggest a 
method of constructing CP, nor does it imply the uniqueness of CP. 

Our attempt here is in two directions: 1) Try to remedy the problems, mentioned earlier, 
with the Dempster's rule; that is, the main effort here is to construct a Bayes-like rule for 
capacities. Suggestions for a new rule were made above. Properties of this new rule need 
further investigation. 2) Try to come up with computationally simple methods of 
constructing CP so that the powerful tools of Bayesian methods could be applied, even with 

the imprecise probabilities. 
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CHAPTER 3 


SET-VALUED MEASURES 


3.1 Introduction 

One of the major criticisms to the Bayesian approach for inference and decision-making is 
its requirement of precise probability values. It has been argued by many people that prior 
probabilities are subjective and thus it would be unrealistic to assign crisp and precise 
values. 


Two possible solutions to this problem were distributionally robust approach and the 
Dempster-Shafer theory. Even though, robust approaches are conceptually easy and 
appealing, obtaining closed form solutions is usually very difficult, except perhaps for 
certain type of neighborhoods. Also, the solution is really a " worst-case" type solution. 

The belief (and plausibility) functions of the D-S theory being monotone (and alternating) 
capacities of infinite order, are generalization of "classical" measure; but the theory is 
mainly constructed around Dempster's rule of combination. In our opinion any theory of 
statistical inference which is based on Dempster's rule of combination would have serious 
problems and should be abandoned. 

A more natural solution would be to generalize classical measure theory, so that measures 
instead of taking values in ^.or 3?, take values in subsets of %or i.e., T (%) or 'RX n ). 


3.2 Set-valued measures 

A set-valued measure was introduced by Artstein [1]. (Actually, earlier related work was 
done by Debru and Schmeidler [6]). A set-valued measure (SVM) is a o-additive set- 
function which takes on values in the nonempty subsets of a Euclidean space. Let (Q,J3) be 
a measurable space, and K(#*) be the nonempty compact subsets of 3L n Then a SVM is 

defined as, 

DEFINITION : A set-valued measure is a set function. 
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\i:A-+K(£) 


(3.1) 


with the following properties: 

( 1 ) 11 ( 0 ) = 0 

oo 

(2) (i( u~A. ) = p.(A.) , for every disjoint family {A^} . , A^€ !A . 

j=i 

OO 

where the summation above, is a series of compact subsets of The sum ^ |i(Aj) of 

j=l 

OO 

the subsets |i(Aj), consists of all the vectors a= £ aj where the series is absolutely 

j=l 

convergent, and aje)i(Aj) for j= 1,2,... . 

The interval- valued probability measure (IVPM) o (see Negoiwta and Ralescu [28]) is a 
special type of SVM defined as 

<J> : K ( [0,1] ) (3.2) 

and satisfying the properties 

( 1 ) l€d>(Q); 

(2) 0(u“A ) = £d>(A) . 

j=i 

where uAj is disjoint collection of events in .3 and the summation is as defined earlier. 

Example 1: Suppose £2 = {o>i, 0 ) 2 , 0 ) 3 } and the (objectively, or subjectively) following 
values are obtained for 

<!>({coi}) = [0.6, 0.7] 

<X>({0>21) = [0.1, 0.15] 


Then one necessarily gets d>({o> 3 }) = [a,0.15], where a < .15. Note that, 

3 

O (Q) = 0 ( CDjU to 2 u 0 ) 3 ) = Y O ( to. ) 

1=1 

= [ -7+a , 1 ] 


and for a =.15, <D(Q) = [0.85,1]. 
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Remark : Note that 0(0) = [0.85,1] * [l}(or[l,l]). This is also counter-intuitive; we 
will return to this point shortly. 

Negoiwta and Ralescu [28] have shown the following results. 

Result 1 ; The conditional probability, given Me ft of an event Ae ft is given by 


0(AIM) = 


1 


sup O (M) 


<D(AnM) 


(3.3) 


Result 2: <I>(MIA) and O(AIM) are related by, 


O ( M I A ) = SUP ^ <D(AIM) 
supO(A) 


(3.4) 


and most importantly, the Bayes formula for the interval-valued probability measures 
(TVPM) is given by 


THEOREM : 

Let Ai,A 2 ,...,A n form a partition of the sample space £2, and let Be ft be an event. 
Then 

sup O ( A. ) . , „ 

O ( Aj I B ) = — O(BIA-) , i=l,2, ,...,n . (3.5) 

^ sup <I> (Aj) supO(BIA j ) 
j=i 

Returning to the problem of statistical inference and decision-making, let Xbe the sample 
space of outcomes (or data), and © be a finite parameter space, i.e. @={0i,...,0 n }. Let 7t Q 
be an interval-valued prior probability measure on 0 and [d>e(x); 0e ©} be a family of 
conditional interval-valued probability measures on X Then the Bayes theorem above can 
be restated as 


sup 7t 0 (e.) 

7t (a I x ) =- 1 0 i) • (3.6) 

^ sup 7t 0 (e.) sup O (x I e. ) 
j=i 


32 


Then upon observing x, the inference or decision-making could be based on the interval- 
valued posterior probability measure 7t(0jlx). 


Example 2 : Suppose X= { x 2 ), and © = { Gj, e 2 } and the values of Kq and [Oq, 


06 0} are as 


rc 0 ( 0j ) = [ .5 , .6 ] 
7t 0 (e 2 ) = [.2,.4] 


and 


Then, 


f O ( Xj I Gj) = [ .1 , .3 ] 
| <X> ( x 2 I 6j) = [ .6 , .7 ] 


O(x,l0 2 ) = [.8 , .9] 
O(x 2 Ig 2 ) = [0 , .1 ] 


K(0llXl) (.6)(.3) + (.4)(.9) 

= [•11 ,-33] 

^ e l'^ >= (.6)(.7) -f(.4)(.l) [ ' 6 ’- 71 
= [ .78 , .92 ] 

«<V*,>- (. 6) ^ 4 (,4)(,9) [ 8 ' 91 
= [ .59 , .67 ] 

[ 0 .. 1 ] 


.4 


k(Q 2 ]x 2> (.6)(.7) + (.4)(.l) 

= [ 0 , .08 ] . 


The following definitions and theorem are due to Artstein [1] and Puri and Ralescu [32] 
and will be used in the sequel. 

DEFINITION: 

An atom of the interval-valued probability measure tc is an event Ae A with tt(A) * 

{0} and such that AjC A implies 7t(Aj) = {0} or TtfANAj) = {0}. An interval- 
valued probability with no atoms is called nonatomic. 

DEFINITION: 

A selection p of an interval-valued probability measure n is a vector-valued measure 
p: A-Mp, such that p( A)e 7t(A) for every Ae A. 


33 



THEOREM: 

(i) If n is a bounded, nonatomic set-valued measure, then rc(A) is convex for every 
AeJT 

(ii) If k is a bounded set-valued measure, then for every Ae A and se tc(A), there 
exists a selection p of k such that p (A) = s. 

Note that clearly interval-valued probability measures are bounded. 

Remark: For nonatomic interval -valued probability measures, let us denote pi(A)=inf 
7t(A) and p 2 (A)=sup rc(A). 

COLLORARY: 

For a nonatomic interval-valued probability measure, p2 is a regular probability 
measure. 

proof: This follows from the convexity of iz and the requirement of le 7t(Q). 

A point mentioned earlier and delayed for here is that the above definition of interval-valued 
probability measure requires lejt(Q), instead of it(£2) = {1}; n(Q)=[a,l] where a < 1. 
This seems counter-intuitive because one expects that Q should happen almost surely. 


There are perhaps two ways this point may be addressed: 

1) Allow the possibility of 7 t(Q)=[a,l] a < 1, and interpret the quantity (1-a) as the 

"degree of uncertainty" about the space of outcomes, Q. 

2) Add the extra requirement that 7t(Q)=l. But from this requirement, plus the 
requirement of additivity, under Minkowski set additions, it immediately 
follows that one may come up with the interval-valued probability of an event 
A, such that, rt(A)=[p,q] and p > q ; i.e., the set of values n takes on may be 

possibly not an ordered set. 

If one should insist that n take on values from an ordered set, plus the requirement 
rc(Q)={ 1 } and the additivity property, then one should replace Minkowski addition with a 
different type of set addition operation. 

Since the main subjects under consideration are inference and decision-making, these 
issues are addressed next. 
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CHAPTER 4 


INFERENCE AND DECISION-MAKING WITH IMPRECISE 
POSTERIOR PROBABILITIES 


4.1 Introduction 

Regardless of the method used to model imprecise prior probabilities and the conditional 
probabilities, and how they are combined to obtain posterior probabilities, the next issue is 
how does one proceed with these imprecise posteriors to make inferences and decisions. 

In statistical inference the goal is not to make an immediate decision, but instead to provide 
a "summary" of the statistical evidence which a wide variety of future "users" of this 
evidence can easily incorporate into their own decision-making process. Posterior 
probabilities carry the required information. So as far as the statistical inference is 
concerned, once the posterior probabilities are obtained the task is completed. 

In a decision-making process, however, given an observation, prior information and the 
models (or conditional densities), rationality dictates that an action aj, from the set of 
possible actions, should be chosen that has minimum expected loss (risk). 

To be more specific, let us assume a countable parameter set 0, an action set #={ai t 
a 2 >— ,a m }, an observation set X, and a loss function 

L : &x & % (4.1) 

such that L(aj,0j) is the loss incurred when action a. x is selected and the state of nature 
(parameter) is aj ; and the set (D= { 81 , 82 ,...} of nonrandomized decision functions 

8 (4.2) 

Note that in many applications (e.g., estimation problems) d = ®. Furthermore, let us 
represent the "posterior" upper and lower "probabilities" obtained from combination of 
imprecise priors and imprecise model by {P*x( 6 i) and P* x (0i); fye©, xex}. We put 
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"posterior" and "probabilities" in quotation marks, because these upper and lower 
quantities may not be posteriors in the Bayesian sense, and most likely would not be 
probabilities in the classical probability sense; at best, they may be ^-alternating and °°- 
monotone capacities. The question is: 

Given {P%(0i) and P* x (0j); xe x] how does one compute expected losses? 


4.2 How should upper and lower expectations be defined? 

Without loss of generality, assume the loss function is a positive function. Then a natural 
way to define upper and lower expected loss is to define them (analogous to classical 
probability) as 


and 


E* L(a.,e) = £ k . P* x { 0- : L(a.,e*) = k ) 

{ k :(3G) & k= Ua.,9) ) 


(4.3) 


E* L(a.,e) = £ k . P { 0' : L(a., 0 ') = k } 

[k :(30) & k= ua.,0) } 


(4.4) 


Note that E* [ and E* ] would be 2 (or higher order) alternating [and monotone] capacities if 
P* [and P*] are 2 (or higher order) capacities. 

Wolfeson and Fine [47], following Dempster, define the upper and lower expectation as 

E* L(a.,0) 4 L (a,) = J k • [ < k }) - (4#5) 

(k :(30) & k=LO i( 0) } 

P. ({ 0-:L(a 0) < k }) ] 

X 1 

and 

E„ L(a.,0) 4 L (a ( ) = ^ k . [ P* ([ 0’: L(a.,0-) < k }) - (4.6) 

{k :(30) & k= LOj.e) } 

P* x ({ 0':L( ai ,0)<k})] 
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When P* [and P*] are 2-altemating [ and monotone] capacities, E* [ and E* ] have, among 
others, the following properties: 

1) (VZ) E*Z£E,Z 

2) E* (-Z) = - E„ (-Z) ; i.e E* and E* are conjugates. 


Also if one obtains P* and P* from 


P*(A) = sup P(A) VAej? 

Pe 

(4.7) 

P,(A)= inf P(A) VAe^ 

Pefp 

(4.8) 

E*(Z) = sup E p (Z) 

(4.9) 

Pe fp 


E*(Z) = inf E (Z) 

(4.10) 


Ve(P 

Note that the above upper probabilities are used to compute the lower expectations and 
vice-versa. Note also that upper and lower expectations given by 

E* L(a.,e) = J k . P % ( {o’: L(a.,e) = k } ) (4.11) 

{k :(30) & k= L(aj,e) ) 

E* L( ai ,e) = £ k . P* x ( [ 0 -: L^.o) = k } ) (4.12) 

{k :(30) & k= ua.,0) } 

are different than the ones given in (4.3) and (4.4). Furthermore, since in general 

P‘( { 0 : L(a r 0 ) = k }) * [ P*( { 0 : L(a., 0 ) < k }) - P*( { 0 : L(a i 5 0 ) < k}) ] (4.13) 

P,( [ 0 : L( ai , 0 ) = k }) * [ p,( { 0 : L(a ; , 0 ) < k }) - P*( [ 0 : L(a., 0 ) < k}) ] (4.14) 
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using the right hand side of (4.13) and (4.14) in (4.3) and (4.4) would result in yet 
different values. Which one of the upper and lower pair of values is correct ? One may have 
to experimentally justify one pair over the other. Thus given an observation, regardless of 
which method is used to get the expected values, one obtain a pair of upper and lower 
expected losses. Then decisions are based on the values of these pairs. 

With the usual point-value probabilities, expected losses are also point-valued; and we 
choose an action that has minimum expected loss (risk). For upper and lower expected 
losses, however, the problem is a little more complicated. 

When the upper and lower expected loss (U&L EL) intervals are non-intersecting, the 
choice of an action is easy. That is, we order acts by dominance: aj > a 2 (read aj is 
preferred to a 2 ) if and only if 

L(aj) > Map (4.1.5) 

And for more than two actions, we choose action a** such that 

a* = arg ( max L(a.)) (4.16) 

j 

When the (U&L EL) intervals overlap, however, we face the problem of indecisiveness. 
When L (ap > L (a) but L (a.) < L (a.) (i.e., [ L (ap, L (ap ] c [ L (a ; ), L ( a p ] , that is 
intervals are nested, and it is not clear which action should be preferred and why. 

What can be done, however, is to eliminate from the set of possible actions, those that are 
not preferable. That is, suppose for ak, k*i, k*j, k=l,2,...,m. 

Map < L (ap 


and 

Map < L (ap- 


Then eliminate ak, k*i, k*j, k=l,2,...,m from further considerations. And try to resolve 
the remaining indecision between aj and aj. Note also that one may face indecisiveness 
between aj and aj when, 
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and 


L (a^ > L (a.) 


LCap > L-Cap 


There are two possibilities at this point: 1) Claim indecisiveness and require more 
information (e.g., in the form of more sample data for the ffequentist approach), 2) Use 
some ad hoc but "reasonable" approach to resolve the problem. Let us show the above 
situation graphically (see fig. 4.1). 



L(&j) 


L(a.) 1 


L( a i) T " L(<lj) 

L( a x) 1 «*>> 


(a) 


(b) 


T 


L(»i) 


A L( *j> 


L(»i) J- 



(c) (d) 

Fig. 4.1- Four possibilities for actions aj and aj with overlapping expected 
utilities : a) W a pmuch larger than L(a ; ) but L(aj) slightly larger 

than L(a.) . b) , c) , d) etc. 

In fig. 4. 1 above the following is recommended: 

For a) aj > % That is aj is preferred over aj. 

For b) aj and aj are about equally preferable; this situation can happen in point- 
valued expected loss problems too when the expected loss of two actions are 
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equal. We say that we are indifferent about aj and aj ; and use a "tie-breaking" 
rule to decide. 

For c) aj > aj 

For d) Again aj > a^ 
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CHAPTE RS 


5.1 Summary 

This work examined the following three issues. First, how best to describe the imprecise 
knowledge about prior probabilities and conditional densities. Second, how best to 
combine these imprecise values to get the so called posterior probabilities. And finally how 
to make decisions with imprecise posteriors. 

Various methods in the literature such as distributionally robust approaches, Dempster- 
Shafer theory and set-valued measures were examined. It was noted that even though 
distributionally robust approaches offer intuitively simple ways of expressing imprecision 
in the available knowledge, in general obtaining closed form solutions for the minimax 
decision rules except for some special families of distributions, namely classes of 
distributions majorized with Choquet capacities, are very difficult. Also robust methods 
really treat only the problem of designing against the worst case situations. 

In examining the Dempster-Shafer (D-S) theory, it was noted that even though D-S theory 
provides a reasonable method for modeling imprecision, there are at least two major 
problems with the theory: 1) The theory is mainly built around the Dempster's rule of 
combination of evidences; this rule, however, has been under major criticisms. Recalling 
that Dempster-Shafer’s upper and lower probabilities (or in D-S language, the plausibility 
and belief) are «o- alternating and 00 -monotone capacities, respectively, then the main thrust 
here should be an attempt to find a Bayes-like rule for capacities. 2) The computational 
complexity of the Dempster's rule is shown to be #P-complete [30]. That is, even given as 
input a set of tables representing basic probability assignments nij, m 2 , .... m,, over a 
frame of discernment 0, and a set A C 0, the problem of computing the basic probability 
value (nijffi m 2 © ... © m n )(A) is #P-complete. 

Interval-valued probabilities (or set-valued measures in general) start from the very 
beginning by assigning intervals (or sets) to each event. That is, if one is not able to assign 
single values to the probabilities of events, one will assign intervals (or sets) of values for 
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the probabilities. The real- valuedness axiom of conventional probability theory is relaxed. 
Then, in an attempt to preserve the (countable) additivity axiom, the additivity is defined in 
terms of set additions. The main problem here, at least with the current definition of set 
additions, is that one cannot simultaneously enforce the requirements that: 1) measure of 
null even has to be zero; 2) keep the additivity axiom; and 3) have the measure of the sure 
event equal to one. Therefore, the third requirement is relaxed. This is, however, quite 
counter-intuitive since then one could define a new event and assign the remaining 
probability mass to this event. 

Finally, we looked at the issue of decision making with imprecise posterior probabilities. 
This rises from the fact that if one starts with imprecise models and/or imprecise priors one 
is bound to arrive at imprecise posteriors. The specific form of the set of the posteriors at 
this point is irrelevant. Even though some specific situations were considered, the problem 
basically still remains as an open problem. This is because the conventional decision theory 
(based on the utility theory) assumes point-valued probabilities. Preferences on the set of 
actions or decisions are ordered using their expected utilities. It is this ordering property 
that is lost when we consider sets of imprecise probabilities. 
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